Click below to select which one you want to view.
This study, using data from IBGE (Brazilian Institute of Geography and Statistics) from 1991, 2000 and 2010, sought to create metrics which allowed suggestions for opening a new retail enterprise in promising municipalities. The approach was based on the premise that certain factors are decisive in the success of a retail business. While others simply have little or no relevance. The study carried out did not seek to prove the previous statement, but to imagine a scenario in which this premise is real and allows suggesting a location for opening a new business. Without external criteria that make it possible to base the “success” of the establishment, it was established that the metrics of population and per capita income are decisive. Through the integration of time series (ARIMA) and neural network (Keras), models that start from different assumptions. Implying, in a certain way, an “agreement” between what was predicted by the models. While ARIMA projected data for the year 2020, Keras applied the model learned in the years 1991, 2000 and 2010 to the data projected by ARIMA. Finally, “concordant” data between the models was selected and the three best cities were selected based on filters and two investor profiles. One profile seeks a high profit margin at the expense of sales volume, the other seeks a high sales volume at the expense of profit margin.
Imagining a scenario in which a retailer is interested in opening a new business in Brazilian municipalities. And using IBGE data from 1991, 2000 and 2010. Which municipality suggestion would be possible? Why?
This is a primary study and its scope is limited to IBGE data. In this study, we sought to make predictions for the year 2020 (ARIMA) and “validate” them with Keras.
Its general goals were:
Its specific aim to:
ARIMA was chosen because it is especially effective at modeling trends and linear temporal patterns in data.
Keras, on the other hand, was chosen for its ability to deal with complex and non-linear relationships, mainly between different characteristics or variables in the data set.
In other words, the objective is the same, but using different paths. In this way, assessing an “agreement” between the predictions of both models. Using ARIMA and Keras combined, they make it possible to infer a mutual (de)validation of predictions.
To display the libraries and tools used, click the button:
To display details about the Data Frame click on the button:
The variables are, respectively…:
# Separando por anos
df_1991 <- data.frame(df_raw[df_raw$ANO == 1991,])
df_2000 <- data.frame(df_raw[df_raw$ANO == 2000,])
df_2010 <- data.frame(df_raw[df_raw$ANO == 2010,])Checking if the data frames are equivalent:
# Checando se os dataframes têm entradas idênticas (Em particular os municípios).
## Checando se colunas têm o mesmo tamanho
if (all(length(df_1991$NOMEMUN) == length(df_2000$NOMEMUN), length(df_1991$NOMEMUN) == length(df_2010$NOMEMUN))) {
## Checando se colunas têm o mesmo valor (municípios)
if (identical(sort(df_1991$NOMEMUN), sort(df_2000$NOMEMUN)) &&
identical(sort(df_2000$NOMEMUN), sort(df_2010$NOMEMUN)))
{cat("Valores iguais.")
} else {cat("Alguns valores diferentes.")
}} else {cat("Tamanhos diferentes")}## Valores iguais.
if (identical(df_1991$NOMEMUN, df_2000$NOMEMUN) && identical(df_2000$NOMEMUN, df_2010$NOMEMUN)) {cat("Nomes e posições exatamente iguais.")}## Nomes e posições exatamente iguais.
Deleting currently irrelevant data.
# Selecionando colunas de interesse e isolando (Por index do município)
df_1991_ByMdx <- data.frame(df_1991[,c(16,6,7,8,9,10,11,12,13,14,15,17,18)])
df_2000_ByMdx <- data.frame(df_2000[,c(16,6,7,8,9,10,11,12,13,14,15,17,18)])
df_2010_ByMdx <- data.frame(df_2010[,c(16,6,7,8,9,10,11,12,13,14,15,17,18)])consolidado <- data.frame(rbind(df_1991_ByMdx,
df_2000_ByMdx,
df_2010_ByMdx))
# Atribuindo para objeto sem Index do Município
df_keras <- data.frame(consolidado[,-1])
df_ranked <- data.frame(df_keras[,1:7])
df_kerasPreliminary correlation analysis in the years 1991, 2000 and 2010:
# Selecionando variáveis
cor_consolidado <- cor(consolidado[, c("AREA_KM2","KM_DIST_CAP",
"ESPVIDA","FECTOT","T_ENV","GINI",
"PIND","RDPC","PESO15","PESOTOT",
"DEM_DEMO","TXDSEMP")])
# Arredondando resultado
cor_consolidado <- round(cor_consolidado,2)
# Heatmap
G_HM_cor_cons <- plot_ly(
type = "heatmap",
colorscale = "Portland",
z = cor_consolidado,
x = colnames(cor_consolidado),
y = rownames(cor_consolidado),
zmin = -1,
zmax = 1,
reversescale = TRUE
) %>% layout(
title = "Matriz de Correlação",
font = list(color = '#FFFFFF'),
paper_bgcolor = "#222222",
showlegend = FALSE
)
# Exibindo anotações no gráfico onde valor diferente de 1.
for (nr in 1:nrow(cor_consolidado)) {
for (nc in 1:ncol(cor_consolidado)) {
if (cor_consolidado[nr, nc] != 1) {
G_HM_cor_cons <- G_HM_cor_cons %>%
add_annotations(
text = round(cor_consolidado[nr, nc], 6),
x = colnames(cor_consolidado)[nc],
y = rownames(cor_consolidado)[nr],
showarrow = FALSE,
font = list(size = 14, color = "white"))}}}
G_HM_cor_consAlthough the data above is rich in socioeconomic information, in this study RDPC, PESOTOT and DEM_DEMO were selected.
The code below selects each municipality individually in the years 1991, 2000 and 2010. And based on this linear chronological organization, it projects data for 2020.
# Indexando ano
df_dec <- cbind(df_raw[2], df_keras)
# Removendo Areas
arima_df <- df_dec[, c(1,4,5,6,7,8,9,10,11,12,13)]
# Separando as décadas
arima_1991 <- data.frame(arima_df[arima_df$ANO == 1991,])
arima_2000 <- data.frame(arima_df[arima_df$ANO == 2000,])
arima_2010 <- data.frame(arima_df[arima_df$ANO == 2010,])
# Reset para debug
arima_2020 <- arima_2010[1,] # Adiciona linha e header
arima_2020 <- arima_2020[-nrow(arima_2020),] # Remove linha e mantém header
A_arima_2020 <- arima_2010[1,] # Adiciona linha e header
A_arima_2020 <- A_arima_2020[-nrow(A_arima_2020),] # Remove linha e mantém header
# Para cada elemento em cada linha do df_1991_1:
for (k in seq_len(nrow(arima_1991))) {
# Seleciona linhas
df_temp_forecast <- arima_1991[k,]
df_temp_forecast <- rbind(df_temp_forecast, arima_2000[k,])
df_temp_forecast <- rbind(df_temp_forecast, arima_2010[k,])
arima_2020 <- arima_2010[k,] # Apenas pela estrutura, será sobrescrito.
arima_2020$ANO[1] <- 2020
# Colunas para iteração
cols_var <- c("ESPVIDA","FECTOT", "T_ENV","GINI","PIND",
"RDPC","PESO15","PESOTOT","DEM_DEMO","TXDSEMP")
for (i in cols_var) {
arima_2020[[i]][1] <- mean(df_temp_forecast[[i]]) # 0 ou media
temp_col <- df_temp_forecast[[i]] # seleção de coluna predição
try({ temp_line <- arima(temp_col, order = c(0,0,1))},silent = TRUE)
forecast_2020 <- forecast(temp_line, h = 1) # h = 1 (2020)
arima_2020[[i]][1] <- forecast_2020$mean[1]} # Resultado a se guardar
# Joga resultado para df_2020_1
A_arima_2020 <- rbind(A_arima_2020, arima_2020[1, ])
arima_2020 <- arima_2020[-nrow(arima_2020),]
# Debug para acompanhamento
if (k %% 500 == 0) {print(k)}
else if (k == 5565) {print("Concluído")}}## [1] 500
## [1] 1000
## [1] 1500
## [1] 2000
## [1] 2500
## [1] 3000
## [1] 3500
## [1] 4000
## [1] 4500
## [1] 5000
## [1] 5500
## [1] "Concluído"
# Exportando para csv
write.csv(A_arima_2020, "Arima_IBGE_2020", row.names = FALSE)
write.csv(A_arima_2020, "Arima_IBGE_2020_RowName", row.names = TRUE)With ARIMA’s projections for a 2020 IBGE, the values are consolidated so that they can then be applied by Keras.
## [1] FALSE
Arima_Prdct <- data.frame(Arima_Bkp)
# Padronizando data frames
# Inserindo variáveis
Arima_Prdct$AREA_KM2 <- df_keras$AREA_KM2[1:5565]
Arima_Prdct$KM_DIST_CAP <- df_keras$KM_DIST_CAP[1:5565]
# Reordena e não seleciona $ANO de 2020
Arima_Prdct <- Arima_Prdct %>%
select(c(
"AREA_KM2", "KM_DIST_CAP","ESPVIDA", "FECTOT", "T_ENV",
"GINI","PIND","RDPC","PESO15","PESOTOT","DEM_DEMO","TXDSEMP"
))Having projected data for 2020 with ARIMA (using the years 1991, 2000 and 2010). Now a model has been trained for later application in 2020. That is:
Click above to select which one you want to view.
# Selecionando target
x <- df_keras[, c("AREA_KM2", "KM_DIST_CAP", "ESPVIDA", "FECTOT", "T_ENV",
"GINI", "PIND","PESO15","PESOTOT","DEM_DEMO","TXDSEMP")]
y <- df_keras$RDPC
# Normalização dos dados
x <- scale(x)
# Definindo o modelo DNN
model_keras <- keras_model_sequential()
model_keras %>%
layer_dense(units = 64, activation = 'relu', input_shape = dim(x)[2]) %>%
layer_dense(units = 64, activation = 'relu') %>%
layer_dense(units = 1)
# Compilando modelo
model_keras %>% compile(
optimizer = optimizer_rmsprop(),
loss = 'mse',
metrics = c('mae')
)
# Treinamento
history <- model_keras %>% fit(
x, y,
epochs = 100,
validation_split = 0.2, # Split treinamento/validação
verbose = 0 # Ocultar Epoch's
)
# Desempenho do modelo
metrics <- model_keras %>% evaluate(x, y)## 522/522 - 1s - loss: 3907.9136 - mae: 34.8890 - 562ms/epoch - 1ms/step
## [1] "Mean Absolute Error (MAE): 34.888973236084"
## [1] "Mean Squared Error (MSE): 3907.91357421875"
## loss mae
## 3907.91357 34.88897
# Análise Comparativa
# Predições realizadas (Para ranking)
predicoes_RDPC <- model_keras %>% predict(x)## 522/522 - 1s - 556ms/epoch - 1ms/step
# Df pred/real
comparativo_RDPC <- data.frame(Previsoes = predicoes_RDPC, Real = y)
# Dispersão
plot(comparativo_RDPC$Real, comparativo_RDPC$Previsoes,
xlab = c("RDPC Real", metrics),
ylab = "Previsões do Modelo",
main = "Comparação entre Previsões e Valores Reais")# Keras RDPC (Ranking dos Municípios)
df_ranked$Pred_RDPC_Rnk <- predicoes_RDPC
df_ranked$PESO15 <- df_keras$PESO15
summary(predicoes_RDPC)## V1
## Min. : 36.38
## 1st Qu.: 182.19
## Median : 299.39
## Mean : 340.94
## 3rd Qu.: 469.31
## Max. :1604.38
# Setup
x_Arima_Prdct <- Arima_Prdct[, c("AREA_KM2", "KM_DIST_CAP", "ESPVIDA",
"FECTOT", "T_ENV", "GINI","PIND","PESO15",
"PESOTOT","DEM_DEMO","TXDSEMP")]
x_Arima_Prdct <- scale(x_Arima_Prdct)
# Previsões
predicoes_RDPC_Arima_Prdct <- model_keras %>% predict(x_Arima_Prdct)## 174/174 - 0s - 170ms/epoch - 974us/step
# Selecionando target
x <- df_keras[, c("AREA_KM2", "KM_DIST_CAP", "ESPVIDA", "FECTOT", "T_ENV",
"GINI", "PIND","RDPC","PESO15","DEM_DEMO","TXDSEMP")]
y <- df_keras$PESOTOT
# Normalização dos dados
x <- scale(x)
# Definindo o modelo DNN
model_keras <- keras_model_sequential()
model_keras %>%
layer_dense(units = 64, activation = 'relu', input_shape = dim(x)[2]) %>%
layer_dense(units = 64, activation = 'relu') %>%
layer_dense(units = 1)
# Compilando modelo
model_keras %>% compile(
optimizer = optimizer_rmsprop(),
loss = 'mse',
metrics = c('mae')
)
# Treinamento
history <- model_keras %>% fit(
x, y,
epochs = 100,
validation_split = 0.2, # Split treinamento/validação
verbose = 0
)
# Desempenho do modelo
metrics <- model_keras %>% evaluate(x, y)## 522/522 - 1s - loss: 217377904.0000 - mae: 4604.7192 - 563ms/epoch - 1ms/step
## [1] "Mean Absolute Error (MAE): 4604.71923828125"
## [1] "Mean Squared Error (MSE): 217377904"
## loss mae
## 2.173779e+08 4.604719e+03
# Análise Comparativa
# Predições realizadas (Para ranking)
predicoes_PESOTOT <- model_keras %>% predict(x)## 522/522 - 1s - 526ms/epoch - 1ms/step
# Df pred/real
comparativo_PESOTOT <- data.frame(Previsoes = predicoes_PESOTOT, Real = y)
# Dispersão
plot(comparativo_PESOTOT$Real, comparativo_PESOTOT$Previsoes,
xlab = c("RDPC Real", metrics),
ylab = "Previsões do Modelo",
main = "Comparação entre Previsões e Valores Reais")# Keras RDPC (Ranking dos Municípios)
df_ranked$Pred_PESOTOT_Rnk <- predicoes_PESOTOT
summary(predicoes_PESOTOT)## V1
## Min. : 1845
## 1st Qu.: 8557
## Median : 11977
## Mean : 31326
## 3rd Qu.: 19560
## Max. :12438205
# Setup
x_Arima_Prdct <- Arima_Prdct[, c("AREA_KM2", "KM_DIST_CAP", "ESPVIDA",
"FECTOT", "T_ENV", "GINI","PIND",
"RDPC","PESO15", "DEM_DEMO","TXDSEMP")]
x_Arima_Prdct <- scale(x_Arima_Prdct)
# Previsões
predicoes_PESOTOT_Arima_Prdct <- model_keras %>% predict(x_Arima_Prdct)## 174/174 - 0s - 220ms/epoch - 1ms/step
x <- df_keras[, c("AREA_KM2", "KM_DIST_CAP", "ESPVIDA", "FECTOT", "T_ENV",
"GINI", "PIND","RDPC","PESO15","PESOTOT","TXDSEMP")]
y <- df_keras$DEM_DEMO
x <- scale(x)
model_keras <- keras_model_sequential()
model_keras %>%
layer_dense(units = 64, activation = 'relu', input_shape = dim(x)[2]) %>%
layer_dense(units = 64, activation = 'relu') %>%
layer_dense(units = 1)
model_keras %>% compile(
optimizer = optimizer_rmsprop(),
loss = 'mse',
metrics = c('mae')
)
history <- model_keras %>% fit(
x, y,
epochs = 100,
validation_split = 0.2, # Split treinamento/validação
verbose = 0
)
metrics <- model_keras %>% evaluate(x, y)## 522/522 - 1s - loss: 117868.3281 - mae: 67.4136 - 824ms/epoch - 2ms/step
## [1] "Mean Absolute Error (MAE): 67.4136276245117"
## [1] "Mean Squared Error (MSE): 117868.328125"
## loss mae
## 117868.32812 67.41363
## 522/522 - 1s - 648ms/epoch - 1ms/step
comparativo_DEM_DEMO <- data.frame(Previsoes = predicoes_DEM_DEMO, Real = y)
plot(comparativo_DEM_DEMO$Real, comparativo_DEM_DEMO$Previsoes,
xlab = c("RDPC Real", metrics),
ylab = "Previsões do Modelo",
main = "Comparação entre Previsões e Valores Reais")# Keras RDPC (Ranking dos Municípios)
df_ranked$Pred_DEM_DEMO_Rnk <- predicoes_DEM_DEMO
df_ranked$TXDSEMP <- df_keras$TXDSEMP
summary(predicoes_DEM_DEMO)## V1
## Min. : -341.524
## 1st Qu.: 9.402
## Median : 26.185
## Mean : 97.150
## 3rd Qu.: 65.372
## Max. :15828.087
# Setup
x_Arima_Prdct <- Arima_Prdct[, c("AREA_KM2", "KM_DIST_CAP", "ESPVIDA",
"FECTOT", "T_ENV", "GINI","PIND","PESO15",
"PESOTOT","DEM_DEMO","TXDSEMP")]
x_Arima_Prdct <- scale(x_Arima_Prdct)
# Previsões
predicoes_DEM_DEMO_Arima_Prdct <- model_keras %>% predict(x_Arima_Prdct)## 174/174 - 0s - 212ms/epoch - 1ms/step
Comparison of predicted by ARIMA and Keras for 2020:
Click above to select which one you want to view.
Organizing Data Frames:
## [1] FALSE
# Predições de ARIMA para 2020.
A_ARIMA_20 <- round(Arima_Prdct[,c(8,10,11)],2)
colnames(A_ARIMA_20)[1] <- "RDPC"
colnames(A_ARIMA_20)[2] <- "PESOTOT"
colnames(A_ARIMA_20)[3] <- "DEM_DEMO"
# Predições de Keras para 2020.
A_KERAS_20 <- round(Arima_Prdct[, c(13,14,15)],2)
colnames(A_KERAS_20)[1] <- "RDPC"
colnames(A_KERAS_20)[2] <- "PESOTOT"
colnames(A_KERAS_20)[3] <- "DEM_DEMO"The filters selected those municipalities where the predictions of both models agree at 20% variation. This is between 0.8 and 1.2, with 1 being perfect agreement. In this topic, the sample of candidate municipalities is expressed graphically, where the two red lines represent the considered range of the applied filter.
Click above to select which one you want to view.
hist(Ar_Vs_Kr$RDPC,
breaks = 200,
xlim = c(0, 2),
ylab = "Frequência",
xlab = "Variação",
main = "Histograma de renda per capita em 2020: Arima VS Keras")
abline(v = 0.8, col = "red", lty = 2)
abline(v = 1.2, col = "red", lty = 2) Consolidating Data Frames, it contains:
# RENOMEANDO KERAS
An_KERAS_20 <- data.frame(A_KERAS_20)
An_ARIMA_20 <- data.frame(A_ARIMA_20)
An_Mean_20 <- (An_ARIMA_20 + An_KERAS_20) / 2
colnames(An_Mean_20)[1] <- "AxK_Mean_R"
colnames(An_Mean_20)[2] <- "AxK_Mean_P"
colnames(An_Mean_20)[3] <- "AxK_Mean_D"
An_Mean_20 <- round(An_Mean_20,2)
## FILTRO
# Identificação
decisao <- df_2010[, c(1,3,4,5)]
# Ar_Vs_Kr: Filtro
# AxK_Mean: Estável
decisao <- cbind(decisao, Ar_Vs_Kr)
decisao <- cbind(decisao, An_Mean_20)Filter: selects the municipalities where the agreement of predictions between ARIMA and Keras is 20% different. Or 0.8 and 1.2, with 1 being the exact prediction agreement.
Click above to select which one you want to view.
# Considerar apenas os dados que estão entre o primeiro e terceiro quartil
paste("There are",nrow(decisao),"potential municipalities. The filters are the 'safety margins' where there is agreement between the predictions of the two models. These are municipalities that are at 0.8 and 1.2 'in common' prediction between the ARIMA and Keras models. In other words, the 'municipalities that had a similar forecast' (1 being exact, and considering a variation of 20% above or below).")## [1] "There are 5565 potential municipalities. The filters are the 'safety margins' where there is agreement between the predictions of the two models. These are municipalities that are at 0.8 and 1.2 'in common' prediction between the ARIMA and Keras models. In other words, the 'municipalities that had a similar forecast' (1 being exact, and considering a variation of 20% above or below)."
filtro1 <- subset(decisao, RDPC >= 0.8 & RDPC <= 1.2)
paste("Applying the first filter, remains:",nrow(filtro1),"potential places.",nrow(decisao) - nrow(filtro1),"municipalities removed.")## [1] "Applying the first filter, remains: 3417 potential places. 2148 municipalities removed."
filtro2 <- subset(filtro1, PESOTOT >= 0.8 & PESOTOT <= 1.2)
paste("Applying the first filter, remains:",nrow(filtro2),"potential places.",nrow(filtro1) - nrow(filtro2),"municipalities removed.")## [1] "Applying the first filter, remains: 1397 potential places. 2020 municipalities removed."
filtro3 <- subset(filtro2, DEM_DEMO >= 0.8 & DEM_DEMO <= 1.2)
paste("Applying the first filter, remains:",nrow(filtro3),"potential places.",nrow(filtro2) - nrow(filtro3),"municipalities removed.")## [1] "Applying the first filter, remains: 222 potential places. 1175 municipalities removed."
When applying the filters, from 5565 candidates we are
left with 222 municipalities in which Keras and ARIMA agree
with a margin of 20%.
From these results, two profiles were hypothesized as suggestions:
Click above to select which one you want to view.
For retailers that have a sales profile with a high profit margin (taking into account a population with high purchasing power).
The exposed approach was based on the premise that certain factors are decisive in the success of a retail venture. While others simply have little or no relevance.
The study carried out did not seek to prove the previous statement. But imagine a scenario where this premise is real. In this way, making it possible to measure variables and establish better markers for the success of this business. And, finally, an indication of a location for future studies to identify promising municipalities for opening a new retail business.
The data considered relevant were: territorial extension, distance from the capital, life expectancy, total fertility rate, aging rate, Gini index, economic vulnerability population index, population aged 15 or over and unemployment rate. In this context, the following variables were emphasized: per capita income, total resident population and demographic density.
The analysis was carried out using models that originate from different points to make different assumptions about the same goal. Thus, aiming to provide more robust predictions. Implying, in a way, an “agreement” between what was predicted by the models.
Later, filters were created as “safety margins” for this “agreement” of the models. Which, in other words, means “the municipalities that had related forecasts”. The safety margins were 20%. The first filter is per capita income, the second total resident population and the third demographic density.
Starting from 5565 potential municipalities. Applying
the first filter left 3417 candidates. 1397 in
the second filter. Finally, leaving 222 potential
municipalities.
It was concluded that it was possible to preliminarily suggest two profiles: One seeking few sales with a high profit margin; and another seeking high recurrence of sales. The three selected in each profile were:
1:
… for high-income profile.
2:
… for high “turnover” profiles.